Skip to content

[https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails#13172

Open
ziyixiong-nv wants to merge 2 commits intoNVIDIA:mainfrom
ziyixiong-nv:repair-bot-bug5859886
Open

[https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails#13172
ziyixiong-nv wants to merge 2 commits intoNVIDIA:mainfrom
ziyixiong-nv:repair-bot-bug5859886

Conversation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator

@ziyixiong-nv ziyixiong-nv commented Apr 18, 2026

Summary

  • Fix for NVBugs 5859886: [TensorRT-LLM][L0][Post-Merge][main]accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_fp8_blockscale[disable_skip_indexer] timeout
  • Root cause: Skip DeepEP when NVLink symmetric memory init fails
  • Fix: (auto-detected from git commit)
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes

    • Improved NVLink workspace initialization failure handling with graceful fallback mechanism to prevent repeated initialization attempts on known failures.
  • Tests

    • Removed skip waiver for a DeepSeek test case.

@ziyixiong-nv ziyixiong-nv requested a review from a team as a code owner April 18, 2026 00:26
@ziyixiong-nv ziyixiong-nv requested a review from yuxianq April 18, 2026 00:26
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

The changes add failure tracking and error handling for NVLink workspace initialization in the MOE communication module. A new class-level flag records initialization failures, and the strategy factory conditionally skips DeepEP strategies when this flag is set. A test waiver entry is removed.

Changes

Cohort / File(s) Summary
MOE Communication Error Handling
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py, tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
Added class-level flag _WORKSPACE_INIT_FAILED to track NVLink initialization failures. nvlink_one_sided.py now wraps workspace allocation in try/except to capture RuntimeError and AssertionError, setting the flag and re-raising on failure. communication_factory.py conditionally bypasses DeepEP/DeepEPLowLatency strategy selection when the flag is set, falling back to AllGatherReduceScatter.
Test Waivers
tests/integration/test_lists/waives.txt
Removed waiver entry for accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_fp8_blockscale[disable_skip_indexer].

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description includes a summary, test plan, and links, but lacks detailed explanation of what and why the changes were made. Expand the description to explain the root cause, the implementation approach, and how the fix resolves the timeout issue in more detail.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title clearly and specifically summarizes the main change: skipping DeepEP when NVLink symmetric memory initialization fails, which directly addresses the root cause of the timeout issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py (1)

230-237: Consider checking _WORKSPACE_INIT_FAILED before MnnvlMemory.initialize().

The failure check (line 232) occurs after MnnvlMemory.initialize() (line 230). If MnnvlMemory.initialize() is safe to call repeatedly regardless of workspace state, this is fine. However, moving the check earlier would be slightly more defensive.

♻️ Optional: Move failure check before MnnvlMemory.initialize()
+        if self._WORKSPACE_INIT_FAILED:
+            raise RuntimeError(
+                "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously "
+                "failed on this node, skipping repeated initialization attempt."
+            )
+
         # Initialize or reuse workspace
         MnnvlMemory.initialize()
 
-        if self._WORKSPACE_INIT_FAILED:
-            raise RuntimeError(
-                "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously "
-                "failed on this node, skipping repeated initialization attempt."
-            )
-
         if self._WORKSPACE is None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`
around lines 230 - 237, The _WORKSPACE_INIT_FAILED flag should be checked before
attempting to re-initialize workspace resources; move the check for
self._WORKSPACE_INIT_FAILED to immediately before the call to
MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init
previously failed we raise the RuntimeError early and skip calling
MnnvlMemory.initialize(); keep the same RuntimeError message and behavior
otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`:
- Around line 230-237: The _WORKSPACE_INIT_FAILED flag should be checked before
attempting to re-initialize workspace resources; move the check for
self._WORKSPACE_INIT_FAILED to immediately before the call to
MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init
previously failed we raise the RuntimeError early and skip calling
MnnvlMemory.initialize(); keep the same RuntimeError message and behavior
otherwise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ef44833d-2b45-4f1e-8224-51e047ee2be3

📥 Commits

Reviewing files that changed from the base of the PR and between 813d877 and 8f778a3.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

@ziyixiong-nv ziyixiong-nv changed the title [https://nvbugs/5859886][fix] [TensorRT-LLM][L0][Post-Merge][main]accuracy/test_ [https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails Apr 18, 2026
…ails

When NVLinkOneSided workspace initialization fails (MNNVL allocation or
NVSHMEM moe_a2a_initialize), cache the failure and propagate it to skip
DeepEP/DeepEPLowLatency strategies in CommunicationFactory. DeepEP also
relies on NVSHMEM internally and would hang during forward pass if the
NVLink symmetric memory infrastructure is unavailable.

Changes:
- NVLinkOneSided: wrap workspace init (MnnvlMemory + moe_a2a_initialize)
  in try-except, set _WORKSPACE_INIT_FAILED on failure to avoid repeated
  attempts across MoE layers and signal the factory.
- CommunicationFactory: check _WORKSPACE_INIT_FAILED before trying
  DeepEP/DeepEPLowLatency; fall through to AllGatherReduceScatter (NCCL).
- Remove test waiver for test_fp8_blockscale[disable_skip_indexer].

Signed-off-by: Ziyi Xiong <219238287+ziyixiong-nv@users.noreply.github.com>
@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug5859886 branch from 8f778a3 to ef935a1 Compare April 20, 2026 02:22
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44256 [ run ] triggered by Bot. Commit: ef935a1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44256 [ run ] completed with state FAILURE. Commit: ef935a1
/LLM/main/L0_MergeRequest_PR pipeline #34677 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44339 [ run ] triggered by Bot. Commit: ef935a1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44339 [ run ] completed with state FAILURE. Commit: ef935a1
/LLM/main/L0_MergeRequest_PR pipeline #34756 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py Outdated
- Update copyright year to 2025-2026
- Mark _WORKSPACE and _WORKSPACE_INIT_FAILED as ClassVar
- Move _WORKSPACE_INIT_FAILED check before MnnvlMemory.initialize()
- Release workspace/mnnvl_mem on failure to prevent CUDA memory leak
- Broaden exception handling to match PR NVIDIA#13235 pattern

Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44865 [ run ] triggered by Bot. Commit: dd0769e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44865 [ run ] completed with state SUCCESS. Commit: dd0769e
/LLM/main/L0_MergeRequest_PR pipeline #35203 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bobboli bobboli requested review from bobboli and xxi-nv April 22, 2026 08:26
@bobboli
Copy link
Copy Markdown
Collaborator

bobboli commented Apr 22, 2026

Please hold on as i find the repair bot's repro-ed phenomenon is different from the original bug description.

# be broken (detected by NVLinkOneSided workspace init failure). DeepEP also
# relies on NVSHMEM/symmetric memory internally, so it would hang during
# forward pass if the NVLink memory infrastructure is unavailable.
if NVLinkOneSided._WORKSPACE_INIT_FAILED:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role of detecting whether NVLink Symmetric memory is supported shouldn't be dedicated to a specific communication backend, which is not reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants